Corpus Construction Tools

نویسنده

  • Radovan Garabík
چکیده

Современное развитие вычислительной техники позволяет нам принять участие в раньше невозможных направлениях научного исследования естественного языка. Основной, необходимой базой данных являются корпусы языков, в том числе и репрезентативные большие (национальные) корпусы. Уже широко доступны общие программные средства позволяющее эффективно обрабатывать большие количества текстов, как и средства поиска в корпусах. Всё-таки, создание корпуса с большим количеством данных требует определённый план организации обработки текстов, вместе с структурой программного обеспечения. В докладе представлена общая система позволяющая быстро применить специфические черты обработки данных конкретного языка. Обсуждены необходимые аспекты национального корпуса, как с лингвистической, так и с компьютерной точек зрения. Система использует преимущественно современный объектно-ориентированный язык программирования Python, имеющий превосходные возможности обработки текстовых данных. Разметка текста состоит из двух частей, из лингвистической (внутренней) разметки текста, которая является внутренним свойством лингвистических единиц (слов) в тексте, и из общих данных о документах (метатекстовая, внешняя разметка). Внутренняя разметка текста входит прямо в формат обработанных текстов, в результате использования существующих стандартов репрезентации текстовых данных, как XML (XCES). Внешняя разметка сохраняется в простых текстовых файлах, с реляционной базой данных построенной над этой структурой. Introduction There exists a reasonably extensive literature concerning principles of corpora structure and end-user interaction [1, 2, 3, 4 and many others]. However, technical details of corpora construction are usually left out as uninteresting or too closely tied up with a specific corpus, and therefore not applicable in general. As with every big project, creating and maintaining an extensive (i.e. “national”) corpus of written language requires careful thought up design of data structure and of data manipulation. Consequently, each newly created big corpus ends up reinventing the wheel and implementing the data workflow and manipulation from the scratch. During the Slovak National Corpus construction, we did basically the same thing, but we tried to make our design general and clean, in order to serve as an inspiration for eventual other yet to be created big corpora. This does not include end-user information searching by a corpus manager – there are several (thought not many)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis

This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...

متن کامل

Construction and utilization of bilingual speech corpus for simultaneous machine interpretation research

This paper describes the design, analysis and utilization of a simultaneous interpretation corpus. The corpus has been constructed at the Center for Integrated Acoustic Information Research (CIAIR) of Nagoya University in order to promote the realization of the multi-lingual communication supporting environment. The size of transcribed data is about 1 million words, and the corpus would deserve...

متن کامل

Chinese-English Parallel Corpus Construction and its Application

Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...

متن کامل

Introduction of KIBS (Korean Information Base System) Project

This project has been carried out on the basis of resources and tools for Korean NLP. The main research is the construction of raw corpus of 64 million tokens and Part-of-Speech tagged corpus of about 11 million tokens. And we develop some analytic tools to construct and some supporting tools to navigate them. This paper represents the present state of the work carried out by the KIBS project. ...

متن کامل

GeoCorpora: building a corpus to test and train microblog geoparsers

In this article, we present the GeoCorpora corpus building framework and software tools as well as a geo-annotated Twitter corpus built with these tools to foster research and development in the areas of microblog/Twitter geoparsing and geographic information retrieval. The developed framework employs crowdsourcing and geovisual analytics to support the construction of large corpora of text in ...

متن کامل

BootCaT: Bootstrapping Corpora and Terms from the Web

This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. The procedure requires only a small set of seed terms as input. The seeds are used to build a corpus via automated Google queries, and more terms are extracted from this corpus. In turn, these new terms are used as seeds to build a larg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006